50 research outputs found
PyramidBox: A Context-assisted Single Shot Face Detector
Face detection has been well studied for many years and one of remaining
challenges is to detect small, blurred and partially occluded faces in
uncontrolled environment. This paper proposes a novel context-assisted single
shot face detector, named \emph{PyramidBox} to handle the hard face detection
problem. Observing the importance of the context, we improve the utilization of
contextual information in the following three aspects. First, we design a novel
context anchor to supervise high-level contextual feature learning by a
semi-supervised method, which we call it PyramidAnchors. Second, we propose the
Low-level Feature Pyramid Network to combine adequate high-level context
semantic feature and Low-level facial feature together, which also allows the
PyramidBox to predict faces of all scales in a single shot. Third, we introduce
a context-sensitive structure to increase the capacity of prediction network to
improve the final accuracy of output. In addition, we use the method of
Data-anchor-sampling to augment the training samples across different scales,
which increases the diversity of training data for smaller faces. By exploiting
the value of context, PyramidBox achieves superior performance among the
state-of-the-art over the two common face detection benchmarks, FDDB and WIDER
FACE. Our code is available in PaddlePaddle:
\href{https://github.com/PaddlePaddle/models/tree/develop/fluid/face_detection}{\url{https://github.com/PaddlePaddle/models/tree/develop/fluid/face_detection}}.Comment: 21 pages, 12 figure
Targeting Ultimate Accuracy: Face Recognition via Deep Embedding
Face Recognition has been studied for many decades. As opposed to traditional
hand-crafted features such as LBP and HOG, much more sophisticated features can
be learned automatically by deep learning methods in a data-driven way. In this
paper, we propose a two-stage approach that combines a multi-patch deep CNN and
deep metric learning, which extracts low dimensional but very discriminative
features for face verification and recognition. Experiments show that this
method outperforms other state-of-the-art methods on LFW dataset, achieving
99.77% pair-wise verification accuracy and significantly better accuracy under
other two more practical protocols. This paper also discusses the importance of
data size and the number of patches, showing a clear path to practical
high-performance face recognition systems in real world
EATEN: Entity-aware Attention for Single Shot Visual Text Extraction
Extracting entity from images is a crucial part of many OCR applications,
such as entity recognition of cards, invoices, and receipts. Most of the
existing works employ classical detection and recognition paradigm. This paper
proposes an Entity-aware Attention Text Extraction Network called EATEN, which
is an end-to-end trainable system to extract the entities without any
post-processing. In the proposed framework, each entity is parsed by its
corresponding entity-aware decoder, respectively. Moreover, we innovatively
introduce a state transition mechanism which further improves the robustness of
entity extraction. In consideration of the absence of public benchmarks, we
construct a dataset of almost 0.6 million images in three real-world scenarios
(train ticket, passport and business card), which is publicly available at
https://github.com/beacandler/EATEN. To the best of our knowledge, EATEN is the
first single shot method to extract entities from images. Extensive experiments
on these benchmarks demonstrate the state-of-the-art performance of EATEN.Comment: 7 page
PyramidBox++: High Performance Detector for Finding Tiny Face
With the rapid development of deep convolutional neural network, face
detection has made great progress in recent years. WIDER FACE dataset, as a
main benchmark, contributes greatly to this area. A large amount of methods
have been put forward where PyramidBox designs an effective data augmentation
strategy (Data-anchor-sampling) and context-based module for face detector. In
this report, we improve each part to further boost the performance, including
Balanced-data-anchor-sampling, Dual-PyramidAnchors and Dense Context Module.
Specifically, Balanced-data-anchor-sampling obtains more uniform sampling of
faces with different sizes. Dual-PyramidAnchors facilitate feature learning by
introducing progressive anchor loss. Dense Context Module with dense connection
not only enlarges receptive filed, but also passes information efficiently.
Integrating these techniques, PyramidBox++ is constructed and achieves
state-of-the-art performance in hard set
HAMBox: Delving into Online High-quality Anchors Mining for Detecting Outer Faces
Current face detectors utilize anchors to frame a multi-task learning problem
which combines classification and bounding box regression. Effective anchor
design and anchor matching strategy enable face detectors to localize faces
under large pose and scale variations. However, we observe that more than 80%
correctly predicted bounding boxes are regressed from the unmatched anchors
(the IoUs between anchors and target faces are lower than a threshold) in the
inference phase. It indicates that these unmatched anchors perform excellent
regression ability, but the existing methods neglect to learn from them. In
this paper, we propose an Online High-quality Anchor Mining Strategy (HAMBox),
which explicitly helps outer faces compensate with high-quality anchors. Our
proposed HAMBox method could be a general strategy for anchor-based
single-stage face detection. Experiments on various datasets, including WIDER
FACE, FDDB, AFW and PASCAL Face, demonstrate the superiority of the proposed
method. Furthermore, our team win the championship on the Face Detection test
track of WIDER Face and Pedestrian Challenge 2019. We will release the codes
with PaddlePaddle.Comment: 9 pages, 6 figures. arXiv admin note: text overlap with 1802.09058 by
other author
Learning Generalized Spoof Cues for Face Anti-spoofing
Many existing face anti-spoofing (FAS) methods focus on modeling the decision
boundaries for some predefined spoof types. However, the diversity of the spoof
samples including the unknown ones hinders the effective decision boundary
modeling and leads to weak generalization capability. In this paper, we
reformulate FAS in an anomaly detection perspective and propose a
residual-learning framework to learn the discriminative live-spoof differences
which are defined as the spoof cues. The proposed framework consists of a spoof
cue generator and an auxiliary classifier. The generator minimizes the spoof
cues of live samples while imposes no explicit constraint on those of spoof
samples to generalize well to unseen attacks. In this way, anomaly detection is
implicitly used to guide spoof cue generation, leading to discriminative
feature learning. The auxiliary classifier serves as a spoof cue amplifier and
makes the spoof cues more discriminative. We conduct extensive experiments and
the experimental results show the proposed method consistently outperforms the
state-of-the-art methods. The code will be publicly available at
https://github.com/vis-var/lgsc-for-fas.Comment: 16 page
Towards Accurate Scene Text Recognition with Semantic Reasoning Networks
Scene text image contains two levels of contents: visual texture and semantic
information. Although the previous scene text recognition methods have made
great progress over the past few years, the research on mining semantic
information to assist text recognition attracts less attention, only RNN-like
structures are explored to implicitly model semantic information. However, we
observe that RNN based methods have some obvious shortcomings, such as
time-dependent decoding manner and one-way serial transmission of semantic
context, which greatly limit the help of semantic information and the
computation efficiency. To mitigate these limitations, we propose a novel
end-to-end trainable framework named semantic reasoning network (SRN) for
accurate scene text recognition, where a global semantic reasoning module
(GSRM) is introduced to capture global semantic context through multi-way
parallel transmission. The state-of-the-art results on 7 public benchmarks,
including regular text, irregular text and non-Latin long text, verify the
effectiveness and robustness of the proposed method. In addition, the speed of
SRN has significant advantages over the RNN based methods, demonstrating its
value in practical use.Comment: Accepted to CVPR202
Editing Text in the Wild
In this paper, we are interested in editing text in natural images, which
aims to replace or modify a word in the source image with another one while
maintaining its realistic look. This task is challenging, as the styles of both
background and text need to be preserved so that the edited image is visually
indistinguishable from the source image. Specifically, we propose an end-to-end
trainable style retention network (SRNet) that consists of three modules: text
conversion module, background inpainting module and fusion module. The text
conversion module changes the text content of the source image into the target
text while keeping the original text style. The background inpainting module
erases the original text, and fills the text region with appropriate texture.
The fusion module combines the information from the two former modules, and
generates the edited text images. To our knowledge, this work is the first
attempt to edit text in natural images at the word level. Both visual effects
and quantitative results on synthetic and real-world dataset (ICDAR 2013) fully
confirm the importance and necessity of modular decomposition. We also conduct
extensive experiments to validate the usefulness of our method in various
real-world applications such as text image synthesis, augmented reality (AR)
translation, information hiding, etc.Comment: accepted by ACM MM 201
Learning Global Structure Consistency for Robust Object Tracking
Fast appearance variations and the distractions of similar objects are two of
the most challenging problems in visual object tracking. Unlike many existing
trackers that focus on modeling only the target, in this work, we consider the
\emph{transient variations of the whole scene}. The key insight is that the
object correspondence and spatial layout of the whole scene are consistent
(i.e., global structure consistency) in consecutive frames which helps to
disambiguate the target from distractors. Moreover, modeling transient
variations enables to localize the target under fast variations. Specifically,
we propose an effective and efficient short-term model that learns to exploit
the global structure consistency in a short time and thus can handle fast
variations and distractors. Since short-term modeling falls short of handling
occlusion and out of the views, we adopt the long-short term paradigm and use a
long-term model that corrects the short-term model when it drifts away from the
target or the target is not present. These two components are carefully
combined to achieve the balance of stability and plasticity during tracking. We
empirically verify that the proposed tracker can tackle the two challenging
scenarios and validate it on large scale benchmarks. Remarkably, our tracker
improves state-of-the-art-performance on VOT2018 from 0.440 to 0.460, GOT-10k
from 0.611 to 0.640, and NFS from 0.619 to 0.629.Comment: Accepted by ACM MM 202
ACFNet: Attentional Class Feature Network for Semantic Segmentation
Recent works have made great progress in semantic segmentation by exploiting
richer context, most of which are designed from a spatial perspective. In
contrast to previous works, we present the concept of class center which
extracts the global context from a categorical perspective. This class-level
context describes the overall representation of each class in an image. We
further propose a novel module, named Attentional Class Feature (ACF) module,
to calculate and adaptively combine different class centers according to each
pixel. Based on the ACF module, we introduce a coarse-to-fine segmentation
network, called Attentional Class Feature Network (ACFNet), which can be
composed of an ACF module and any off-the-shell segmentation network (base
network). In this paper, we use two types of base networks to evaluate the
effectiveness of ACFNet. We achieve new state-of-the-art performance of 81.85%
mIoU on Cityscapes dataset with only finely annotated data used for training.Comment: Accepted to ICCV 201